Profiling Python Applications

The Three-Day Mistake

A senior engineer at a data company spent three days rewriting a text preprocessing function in Cython. The function cleaned and tokenised incoming customer support tickets before they were fed into a classifier. He had read that string operations were slow in Python and was convinced the tokeniser was the problem.

After the rewrite, he ran the service under load. The p99 latency dropped from 340ms to... 332ms.

Confused, he finally did what he should have done on day one: ran a profiler.

Profiling output (sorted by tottime):

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2000    0.003    0.000    0.003    0.000 tokeniser.pyx:12(tokenise)   ← his work
     2000    0.007    0.000    0.007    0.000 classifier.py:45(predict)
     2000    0.298    0.000    0.298    0.000 audit.py:89(log_ticket)      ← !!
     2000    0.012    0.000    0.012    0.000 response.py:23(format)

The log_ticket function was writing to a file using a regex to redact PII from every log line. The regex was compiled fresh on every call. The tokeniser he spent three days on? It consumed 0.3% of total runtime. The regex in the audit logger consumed 89%.

The fix: re.compile() at module level, called once. Time: 20 minutes.

This is not an unusual story. It is the default story for engineers who skip profiling.

What You Will Learn

Use cProfile and pstats to find CPU bottlenecks in any Python program
Use line_profiler to identify the exact slow line inside a function
Use py-spy to profile running production processes without modifying code
Use memory_profiler and tracemalloc to find memory bottlenecks and leaks
Use timeit correctly for micro-benchmarks
Build a profiling-first discipline that prevents the three-day mistake

Prerequisites

Requirement	Level Needed
Python functions and decorators	Comfortable
Command-line basics	Comfortable
pip package installation	Comfortable
asyncio basics (Lesson 05 only)	Helpful

The Python Performance Measurement Stack

Choose the right tool before reaching for any of them. Each tool has a specific use case and significant blind spots.

Tool	Type	Overhead	Use Case	Code Changes?
`timeit`	Timer	None	Micro-benchmarks for small code snippets	Yes
`cProfile`	Deterministic	Medium	Finding which functions are slow, call counts	Yes (or not)
`line_profiler`	Deterministic	High	Finding which lines inside a function are slow	Yes (decorator)
`py-spy`	Sampling	~1%	Profiling running processes, production safe	No
`pyinstrument`	Sampling	Low	Development profiling, beautiful output	Yes or no
`Austin`	Sampling	<1%	Ultra-low overhead, SVG flamegraphs	No
`memory_profiler`	Deterministic	Very high	Line-by-line memory usage	Yes (decorator)
`tracemalloc`	Deterministic	Medium	Allocation tracking, leak detection	Yes
`objgraph`	Snapshot	Low	Finding leaked object types and references	Yes

Decision rule:

Unknown bottleneck → cProfile first
Found the function, need the line → line_profiler
Running production process → py-spy
Memory growing → tracemalloc + objgraph
Memory by line → memory_profiler
Micro-benchmark → timeit

Section 1: `cProfile` and `pstats`

cProfile is the standard starting point. It instruments every function call - recording how many times each function was called and how long each call took. It has medium overhead (typically 2–5x slowdown) and is safe to use in development.

Basic Usage

import cProfile

def slow_function():
    """A realistic function with a hidden bottleneck."""
    import re
    results = []
    data = ["[email protected] ticket #1234 URGENT"] * 10_000

    for line in data:
        # Bug: compiling regex inside the loop - 10,000 compilations
        pattern = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
        emails = pattern.findall(line)
        results.append(emails)
    return results

# Method 1: simple one-liner
cProfile.run('slow_function()')

Output:

         30006 function calls in 0.847 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.847    0.847 <string>:1(<module>)
    10000    0.031    0.000    0.031    0.000 re.py:232(compile)
    10000    0.782    0.000    0.782    0.000 re.py:289(_compile)  ← HERE
    10000    0.023    0.000    0.023    0.000 {method 'findall'}
        1    0.011    0.011    0.847    0.847 example.py:4(slow_function)

The column meanings:

Column	Meaning
`ncalls`	Number of times this function was called
`tottime`	Total time spent in this function, excluding time in sub-calls
`percall`	`tottime / ncalls`
`cumtime`	Total time spent in this function including all sub-calls
`percall`	`cumtime / ncalls` (second percall column)

Key insight: tottime is the most useful column. It tells you how much time this function itself consumed, without attributing time to functions it called. A function with high cumtime but low tottime is slow because of what it calls, not what it does itself.

`cProfile.Profile()` as Context Manager

import cProfile
import pstats
import io

def profile_block(sort_by: str = 'tottime', limit: int = 20):
    """Context manager for profiling a block of code."""
    import contextlib

    @contextlib.contextmanager
    def _profiler():
        pr = cProfile.Profile()
        pr.enable()
        try:
            yield pr
        finally:
            pr.disable()
            s = io.StringIO()
            ps = pstats.Stats(pr, stream=s)
            ps.strip_dirs()
            ps.sort_stats(sort_by)
            ps.print_stats(limit)
            print(s.getvalue())

    return _profiler()

# Usage
with profile_block(sort_by='tottime', limit=10):
    slow_function()

`pstats` for Filtering and Sorting

import cProfile
import pstats
import io

# Profile to a file
cProfile.run('slow_function()', 'profile_output.prof')

# Load and analyse the saved profile
with open('profile_output.prof', 'rb') as f:
    stats = pstats.Stats('profile_output.prof')

# Sort options: 'tottime', 'cumtime', 'ncalls', 'pcalls', 'name', 'filename'
stats.sort_stats('tottime')

# Print only functions that match a filename pattern
stats.print_stats('re.py')   # only regex-related functions
stats.print_stats(0.1)       # only top 10% of functions by count
stats.print_stats(20)        # top 20 functions

# Show callers of a specific function
stats.print_callers('_compile')

# Show what a function calls
stats.print_callees('slow_function')

`snakeviz` - Visual Flamegraphs

cProfile output is hard to read at scale. snakeviz renders the profile as an interactive sunburst or icicle chart in your browser:

pip install snakeviz

# Profile to a file first
python -m cProfile -o myapp.prof myapp.py

# Open the visual profile
snakeviz myapp.prof

For production profiling, save profiles per-request and attach them to slow-request logs:

import cProfile
import pstats
import io
import logging
from functools import wraps

logger = logging.getLogger(__name__)

def profile_if_slow(threshold_ms: float = 200.0):
    """
    Decorator: profile the function if it takes longer than threshold_ms.
    Logs the profile output without affecting normal execution.
    """
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            pr = cProfile.Profile()
            pr.enable()
            try:
                result = func(*args, **kwargs)
            finally:
                pr.disable()
                s = io.StringIO()
                ps = pstats.Stats(pr, stream=s)
                ps.strip_dirs()
                ps.sort_stats('tottime')
                ps.print_stats(15)
                profile_text = s.getvalue()

            elapsed_ms = sum(
                row[2] for row in pr.getstats()
                if row[0] == func.__code__
            ) * 1000

            if elapsed_ms > threshold_ms:
                logger.warning(
                    "Slow function detected: %s took %.1fms\n%s",
                    func.__name__, elapsed_ms, profile_text
                )
            return result
        return wrapper
    return decorator

@profile_if_slow(threshold_ms=100.0)
def process_document(doc_id: str) -> dict:
    # ... production code ...
    pass

Section 2: `line_profiler` - Finding the Slow Line

cProfile tells you which function is slow. line_profiler tells you which line inside that function is slow. This distinction matters enormously - a 100-line function might have its entire cost on line 47.

pip install line-profiler

Basic Usage

# Decorate the function you want to profile
# The @profile decorator is injected by kernprof - do NOT import it
@profile
def preprocess_batch(documents: list[dict]) -> list[dict]:
    """Realistic ML preprocessing pipeline."""
    results = []

    for doc in documents:
        # Step 1: lowercase and strip
        text = doc['content'].lower().strip()

        # Step 2: remove punctuation
        import string
        text = text.translate(str.maketrans('', '', string.punctuation))

        # Step 3: tokenise by whitespace
        tokens = text.split()

        # Step 4: remove stopwords (loaded fresh each time - the bug)
        stopwords = set(open('/usr/share/dict/stopwords').read().split())
        tokens = [t for t in tokens if t not in stopwords]

        # Step 5: build bigrams
        bigrams = [f"{tokens[i]}_{tokens[i+1]}"
                   for i in range(len(tokens)-1)]

        results.append({
            'id': doc['id'],
            'tokens': tokens,
            'bigrams': bigrams,
        })

    return results

Run with:

kernprof -l -v preprocess_example.py

Output:

Wrote profile results to preprocess_example.py.lprof

Timer unit: 1e-06 s

Total time: 8.421 s
File: preprocess_example.py
Function: preprocess_batch at line 4

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     4                                           @profile
     5                                           def preprocess_batch(documents):
     8      5000      12304.0      2.5      0.1      for doc in documents:
    11      5000      45231.0      9.0      0.5          text = doc['content'].lower().strip()
    14      5000      23187.0      4.6      0.3          text = text.translate(...)
    17      5000      18432.0      3.7      0.2          tokens = text.split()
    20      5000    7891234.0   1578.2     93.7          stopwords = set(open(...).read().split())
    21      5000      89234.0     17.8      1.1          tokens = [t for t in tokens...]
    24      5000      78234.0     15.6      0.9          bigrams = [...]

Line 20 consumes 93.7% of the total time. It opens and reads a file on every iteration of the loop. The fix:

# Load stopwords ONCE, outside the loop
STOPWORDS = set(open('/usr/share/dict/stopwords').read().split())

@profile
def preprocess_batch_fixed(documents: list[dict]) -> list[dict]:
    results = []
    for doc in documents:
        text = doc['content'].lower().strip()
        text = text.translate(str.maketrans('', '', string.punctuation))
        tokens = text.split()
        tokens = [t for t in tokens if t not in STOPWORDS]  # uses module-level set
        bigrams = [f"{tokens[i]}_{tokens[i+1]}" for i in range(len(tokens)-1)]
        results.append({'id': doc['id'], 'tokens': tokens, 'bigrams': bigrams})
    return results

After the fix: 0.131 s. 64x speedup, from a one-line change that required no Cython, no Numba, no C extensions - just moving file I/O out of the loop.

Programmatic Usage

from line_profiler import LineProfiler

def add_profile(func):
    """Add line-level profiling to a function programmatically."""
    lp = LineProfiler()
    lp_wrapper = lp(func)

    def run(*args, **kwargs):
        result = lp_wrapper(*args, **kwargs)
        lp.print_stats()
        return result

    return run

Section 3: `py-spy` - The Production-Safe Profiler

cProfile and line_profiler require code changes. They also add significant overhead - 2–5x for cProfile, up to 50x for line_profiler. You cannot use them on a production server under load without impacting users.

py-spy is a sampling profiler. It attaches to a running Python process, reads its call stack at regular intervals (100 times per second by default), and produces a statistical profile. Overhead is approximately 1%. No code changes required.

pip install py-spy
# or: cargo install py-spy (if you have Rust)

Profiling a Running Process

# Find your process ID
ps aux | grep python

# Show a live top-like view (updates every second)
py-spy top --pid 12345

# Record a flamegraph SVG
py-spy record --output profile.svg --pid 12345 --duration 30

# Dump the current call stack (one snapshot)
py-spy dump --pid 12345

The py-spy top output looks like:

Collecting samples from 'python app.py' (pid: 12345)
Total Samples 1200
GIL: 34.00%, Active: 78.00%, Threads: 4

  %Own   %Total  OwnTime  TotalTime  Function (filename)
  23.00  23.00     2.76s    2.76s   log_ticket (audit.py:89)
  18.00  18.00     2.16s    2.16s   _compile (re.py:289)
  12.00  35.00     1.44s    4.20s   handle_request (api.py:145)
   9.00   9.00     1.08s    1.08s   json_loads (json/decoder.py:337)

Profiling a FastAPI Service Under Load

Typical workflow for profiling a production-like service:

# Terminal 1: start the service
uvicorn myapp:app --workers 1 --port 8000

# Terminal 2: generate load
hey -n 10000 -c 50 http://localhost:8000/api/process

# Terminal 3: while load is running, profile it
py-spy record --output flamegraph.svg \
    --pid $(pgrep -f "uvicorn myapp") \
    --duration 20 \
    --rate 100

# Open the flamegraph
open flamegraph.svg

Reading Flamegraphs

A flamegraph is a stack trace visualisation:

Width  = time spent in function (wider = slower)
Height = call depth (bottom = entry point, top = leaf)
Colour = arbitrary (no performance meaning)

                  ┌─ regex._compile ─────────────────┐
        ┌─ audit.log_ticket ────────────────────────────────────┐
┌─ fastapi.handle ─────────────────────────────────────────────────────────┐
│████████████████████████████████████████████████████████████████████████│

Look for the widest bars at the top of the stack - those are the leaf functions where the most time is being spent.

Profiling a Kubernetes Pod

# Find the pod
kubectl get pods -l app=myservice

# Exec into the pod and run py-spy
kubectl exec -it myservice-pod-abc123 -- \
    py-spy record \
    --output /tmp/profile.svg \
    --pid 1 \
    --duration 30 \
    --subprocesses

# Copy the profile out
kubectl cp myservice-pod-abc123:/tmp/profile.svg ./profile.svg

Note: py-spy requires the SYS_PTRACE capability in Kubernetes. Add to your pod spec:

securityContext:
  capabilities:
    add:
      - SYS_PTRACE

Why Sampling Profilers Are Safer

Deterministic profilers (like cProfile) record every function call. They cannot miss any call, but they impose consistent overhead throughout the profiling session. At high request rates this overhead can cause request timeouts and queue buildup.

Sampling profilers read the call stack at intervals. They can miss functions that complete between samples (typically fast utility functions), but they accurately represent where time is spent - which is what you want for finding bottlenecks. The statistical error is bounded and predictable.

For production use, always prefer a sampling profiler.

Section 4: `memory_profiler` - Line-by-Line Memory

CPU profiling finds slow functions. Memory profiling finds functions that allocate too much. These are different problems with different fixes.

pip install memory-profiler matplotlib

Basic Usage

from memory_profiler import profile

@profile
def build_inverted_index(documents: list[dict]) -> dict:
    """Build an inverted index - a common memory-intensive operation."""
    # Step 1: collect all tokens
    all_tokens = []                          # potential memory issue
    for doc in documents:
        tokens = doc['content'].split()
        all_tokens.extend(tokens)            # extends by copying

    # Step 2: build the index
    index = {}
    for doc_id, doc in enumerate(documents):
        tokens = set(doc['content'].split())
        for token in tokens:
            if token not in index:
                index[token] = []
            index[token].append(doc_id)

    return index

Run with:

python -m memory_profiler memory_example.py

Output:

Filename: memory_example.py

Line #    Mem usage    Increment   Occurrences   Line Contents
=============================================================
 52.3 MiB    52.3 MiB           1   @profile
                                     def build_inverted_index(documents):
 52.3 MiB     0.0 MiB           1       all_tokens = []
 52.3 MiB     0.0 MiB       50000       for doc in documents:
 52.3 MiB     0.0 MiB       50000           tokens = doc['content'].split()
412.7 MiB   360.4 MiB       50000           all_tokens.extend(tokens)  ← !!
413.1 MiB     0.4 MiB           1       index = {}
448.2 MiB    35.1 MiB       50000       for doc_id, doc in enumerate(documents):

Line 11 allocates 360 MB because all_tokens accumulates a copy of every token from every document. If you only need the index (not the flat list), all_tokens is unnecessary entirely.

Fixed version:

@profile
def build_inverted_index_fixed(documents: list[dict]) -> dict:
    """Build inverted index without the intermediate flat list."""
    index = {}
    for doc_id, doc in enumerate(documents):
        for token in set(doc['content'].split()):
            index.setdefault(token, []).append(doc_id)
    return index

Memory output for fixed version:

Line #    Mem usage    Increment   Line Contents
=================================================
 52.3 MiB    52.3 MiB   @profile
                         def build_inverted_index_fixed(documents):
 52.3 MiB     0.0 MiB       index = {}
 52.3 MiB     0.0 MiB       for doc_id, doc in enumerate(documents):
 87.4 MiB    35.1 MiB           for token in set(doc['content'].split()):
 87.4 MiB     0.0 MiB               index.setdefault(token, []).append(doc_id)
 87.4 MiB     0.0 MiB       return index

Peak usage: 87 MB instead of 448 MB. A 5x reduction from a 3-line change.

`mprof` - Memory Over Time

# Record memory usage over time
mprof run python memory_example.py

# Plot it
mprof plot

# Plot the most recent run
mprof plot --output memory_profile.png

This produces a time-series plot showing memory usage. Useful for identifying memory leaks in long-running processes: if memory grows monotonically over time, you have a leak.

Section 5: `tracemalloc` - Allocation Tracking Without Dependencies

tracemalloc is built into Python's standard library (Python 3.4+). It requires no external packages and is the right tool when you cannot install third-party packages (restricted environments, containers without pip access).

import tracemalloc
import linecache

def display_top_allocations(snapshot, limit: int = 10) -> None:
    """Display top memory-allocating lines from a tracemalloc snapshot."""
    top_stats = snapshot.statistics('lineno')

    print(f"\nTop {limit} memory allocations:")
    print("=" * 70)
    for index, stat in enumerate(top_stats[:limit], 1):
        frame = stat.traceback[0]
        print(f"\n#{index}: {frame.filename}:{frame.lineno}")

        # Show the actual source line
        line = linecache.getline(frame.filename, frame.lineno).strip()
        if line:
            print(f"    {line}")
        print(f"    {stat.size / 1024:.1f} KiB - {stat.count} allocations")


def find_memory_bottleneck():
    """Demonstrate tracemalloc for finding top allocations."""
    tracemalloc.start(25)  # keep 25 frames of stack trace

    # === Your code under test ===
    import json
    large_data = [{"id": i, "value": "x" * 1000} for i in range(10_000)]
    serialised = json.dumps(large_data)
    parsed = json.loads(serialised)
    # === End code under test ===

    snapshot = tracemalloc.take_snapshot()
    tracemalloc.stop()

    display_top_allocations(snapshot, limit=10)


find_memory_bottleneck()

Snapshot Diffing - Finding Leaks in Long-Running Processes

The most powerful tracemalloc technique: take two snapshots separated in time and look at what grew. This isolates memory that was allocated but never freed - i.e., leaks.

import tracemalloc
import time
import gc
from typing import Optional

class LeakDetector:
    """
    Production-grade leak detector using tracemalloc snapshots.
    Take a baseline snapshot, then compare against it periodically.
    """

    def __init__(self, n_frames: int = 25):
        self.n_frames = n_frames
        self._baseline: Optional[tracemalloc.Snapshot] = None

    def start(self) -> None:
        tracemalloc.start(self.n_frames)

    def take_baseline(self) -> None:
        """Call this after the process has warmed up."""
        gc.collect()
        self._baseline = tracemalloc.take_snapshot()
        print("Baseline snapshot taken.")

    def report_growth(self, top_n: int = 15) -> None:
        """Compare current state to baseline, show what grew."""
        if self._baseline is None:
            raise RuntimeError("Call take_baseline() first.")

        gc.collect()
        current = tracemalloc.take_snapshot()
        stats = current.compare_to(self._baseline, 'lineno')

        print(f"\nMemory growth since baseline (top {top_n}):")
        print("=" * 70)
        for stat in stats[:top_n]:
            if stat.size_diff > 0:  # only show growth, not shrinkage
                frame = stat.traceback[0]
                print(
                    f"{frame.filename}:{frame.lineno} "
                    f"+{stat.size_diff / 1024:.1f} KiB "
                    f"({stat.count_diff:+d} objects)"
                )


# Example: detect a leak in a request handler
detector = LeakDetector()
detector.start()

# Simulate service warm-up
for _ in range(100):
    process_request_warmup()

detector.take_baseline()

# Simulate production traffic
for i in range(1000):
    process_request()
    if i % 100 == 0:
        detector.report_growth(top_n=5)

Finding Where an Object Was Allocated

import tracemalloc

tracemalloc.start(50)  # deep stack traces

# ... run code ...

snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('traceback')

# Show full stack trace for the largest single allocation
stat = top_stats[0]
print(f"Largest allocation: {stat.size / 1024:.1f} KiB")
print("Stack trace:")
for line in stat.traceback.format():
    print(f"  {line}")

Section 6: `timeit` - Correct Micro-Benchmarks

timeit is the right tool for comparing small code snippets. It is routinely misused - here is how to use it correctly.

The Wrong Way

import time

# WRONG: single run, includes interpreter startup,
# does not account for JIT warming, not reproducible
start = time.time()
result = sum(x**2 for x in range(100_000))
print(f"Took: {time.time() - start:.6f}s")

Problems:

Single measurement - high variance from OS scheduling, cache state
time.time() has lower resolution than time.perf_counter()
Does not account for Python startup cost
Does not separate setup from the thing being measured

The Right Way

import timeit
import statistics

def honest_benchmark(
    stmt: str,
    setup: str = "pass",
    n_runs: int = 7,
    loops_per_run: int = 10_000,
    globals_dict: dict = None,
) -> dict:
    """
    Honest benchmark following best practices:
    - Multiple independent runs (n_runs)
    - Multiple loops per run (loops_per_run)
    - Reports median, not mean
    - Separates setup from the timed statement
    """
    timer = timeit.Timer(stmt=stmt, setup=setup, globals=globals_dict)

    # timeit returns total time for all loops in one run
    raw_times = timer.repeat(repeat=n_runs, number=loops_per_run)

    # Convert to per-iteration time
    per_iter = [t / loops_per_run for t in raw_times]

    return {
        "median_ns": statistics.median(per_iter) * 1e9,
        "mean_ns":   statistics.mean(per_iter) * 1e9,
        "stdev_ns":  statistics.stdev(per_iter) * 1e9,
        "min_ns":    min(per_iter) * 1e9,
        "max_ns":    max(per_iter) * 1e9,
    }


# Compare two approaches for building a string
result = honest_benchmark(
    stmt='"".join(parts)',
    setup='parts = ["word"] * 1000',
    n_runs=10,
    loops_per_run=50_000,
)
print(f"join: {result['median_ns']:.1f} ns/iter (±{result['stdev_ns']:.1f})")

result = honest_benchmark(
    stmt='s = ""; [s := s + p for p in parts]',  # quadratic string concat
    setup='parts = ["word"] * 1000',
    n_runs=10,
    loops_per_run=1_000,  # fewer loops - this is much slower
)
print(f"concat: {result['median_ns']:.1f} ns/iter (±{result['stdev_ns']:.1f})")

`%timeit` in Jupyter

In a Jupyter notebook, %timeit is the cleanest interface:

import re

text = "Contact us at [email protected] or [email protected] for help"

# Compile regex once (should be faster)
pattern = re.compile(r'\b[\w.+-]+@[\w-]+\.[a-z]{2,}\b')

%timeit re.findall(r'\b[\w.+-]+@[\w-]+\.[a-z]{2,}\b', text)
# Output: 3.21 µs ± 82.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

%timeit pattern.findall(text)
# Output: 1.87 µs ± 34.1 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

%%timeit (double %) times an entire cell.

Common `timeit` Mistakes

Mistake	Problem	Fix
`n=1` (single run)	High variance, unreliable	Use `repeat=7, number=10000`
Including setup in the stmt	Measuring setup, not the function	Use `setup=` parameter
Using wall time with `time.time()`	OS scheduling noise included	Use `time.perf_counter()`
Reporting mean instead of median	Outliers skew mean	Report median
Not importing in setup	Import overhead included in measurement	Put imports in `setup=`
Comparing different data sizes	Not an apples-to-apples comparison	Hold data size constant

Section 7: Profiling in Production

Development profiling and production profiling are different disciplines.

Continuous Profiling with `py-spy`

Run py-spy as a sidecar or init container alongside your application:

# Dockerfile - add py-spy to the image
FROM python:3.12-slim

RUN pip install py-spy

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . /app
WORKDIR /app

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

# kubernetes-sidecar.yaml - py-spy as a sidecar
apiVersion: v1
kind: Pod
spec:
  containers:
  - name: app
    image: myapp:latest
    securityContext:
      capabilities:
        add: ["SYS_PTRACE"]

  - name: profiler
    image: myapp:latest  # same image has py-spy
    command:
    - py-spy
    - record
    - --output
    - /profiles/$(date +%s).svg
    - --pid
    - "1"
    - --duration
    - "60"
    - --rate
    - "50"
    volumeMounts:
    - name: profiles
      mountPath: /profiles

`pyinstrument` for Development

pyinstrument is a sampling profiler with beautiful terminal output. Better than cProfile for quick development profiling:

pip install pyinstrument

# Profile a script
pyinstrument my_script.py

# Profile a specific function in code

from pyinstrument import Profiler

profiler = Profiler()
profiler.start()

# ... your code ...
result = process_documents(docs)

profiler.stop()
profiler.print()   # beautiful terminal output
profiler.open_in_browser()  # interactive HTML report

Pyroscope - Always-On Profiling

For always-on production profiling, pyroscope pushes profiles to a central server continuously:

import pyroscope

pyroscope.configure(
    application_name="myservice",
    server_address="http://pyroscope:4040",
    tags={
        "env": "production",
        "version": os.environ.get("APP_VERSION", "unknown"),
    },
)

This gives you 24/7 flamegraph visibility without manually triggering profiling sessions.

Section 8: The Profiling Discipline - A Workflow

Combine all the tools into a repeatable workflow:

Step 1: CPU Profile (cProfile)
─────────────────────────────
python -m cProfile -o profile.prof -s tottime myapp.py
snakeviz profile.prof

  → Find the top 3 functions by tottime
  → Note their ncalls and cumtime

Step 2: Line Profile the Hot Function (line_profiler)
──────────────────────────────────────────────────────
Add @profile to the top 3 functions
kernprof -l -v myapp.py

  → Find the exact line responsible
  → Form a hypothesis about WHY it is slow

Step 3: Memory Profile if RSS is High (tracemalloc)
────────────────────────────────────────────────────
Add tracemalloc.start() at entry point
Take snapshot before and after the operation
compare_to() to find growth

  → Find which allocations are growing
  → Check if they are released (GC'd) or retained (leak)

Step 4: Benchmark the Fix (timeit)
────────────────────────────────────
Before changing anything: timeit the slow line
After the fix: timeit the new version
Report speedup with median and stdev

Step 5: Verify Under Production Load (py-spy)
──────────────────────────────────────────────
Deploy to staging
Run load test: hey -n 50000 -c 100 http://staging/endpoint
py-spy record --pid $(pgrep myapp) --duration 30

  → Confirm the previously hot function is now fast
  → Check no new bottleneck appeared

Red-Yellow-Green Triage

When reviewing a cProfile output, triage functions into three buckets:

Colour	Criterion	Action
Red	`tottime` > 10% of total runtime	Profile with `line_profiler`, fix now
Yellow	`tottime` 2–10% of total runtime, `ncalls` > 10k	Investigate - may be worth fixing
Green	`tottime` < 2% of total runtime	Ignore completely

Most programs have 1–3 Red functions. Fix those. Ignore everything else.

Interview Questions

Q1: What is the difference between tottime and cumtime in cProfile output? Which is more useful for identifying bottlenecks?

tottime is the total time spent executing the function's own code, excluding the time spent in functions it called. cumtime is the total time including all recursive calls to sub-functions.

tottime is almost always more useful for bottleneck identification. A function with high cumtime but low tottime is slow because of what it calls - not what it does itself. To fix that function you need to look at its callees. A function with high tottime is itself the bottleneck and is where you direct line_profiler. The exception is when you want to find the highest-level entry point into a slow call tree - then cumtime is the right column to sort by.

Q2: Why is py-spy safer to use in production than cProfile? What are its limitations?

py-spy is a sampling profiler - it reads the process's call stack at intervals (100 samples/second by default) without modifying the process. It attaches via ptrace (Linux) or OS-specific process inspection APIs. Overhead is approximately 1%.

cProfile is a deterministic profiler that instruments every function entry and exit via Python's sys.setprofile() hook. This adds overhead to every function call - typically 2–5x. At high request rates (e.g., 1000 req/s), this overhead can push response times over timeout thresholds and cause cascading failures.

Limitations of py-spy:

Cannot time functions that are too fast to be caught by sampling (< 1ms duration, rarely called)
Requires SYS_PTRACE capability in containerised environments
Cannot be used for micro-benchmarking (not deterministic enough)
Line-level precision is limited compared to line_profiler

Q3: You suspect your FastAPI service has a memory leak. Describe exactly how you would find it using tracemalloc.

First I would enable tracemalloc at application startup with a sufficient frame depth: tracemalloc.start(25). Then I would let the service warm up (handle 100–200 requests to fill caches and pre-allocate common objects). After warm-up I call gc.collect() and take a baseline snapshot: baseline = tracemalloc.take_snapshot().

I then let the service run for 10–30 minutes under normal traffic. After the interval, I call gc.collect() again and take a second snapshot. I compare the two: stats = current.compare_to(baseline, 'lineno'), and filter for stat.size_diff > 0 (only allocations that grew). The top entries by size_diff are the candidates for the leak.

I then look at the traceback attribute of the top stat to get the full stack trace showing where the allocation originated. This is usually sufficient to identify whether it is a growing cache, an unreleased callback, a circular reference preventing GC, or a container being appended to without being cleared.

Q4: What is a flamegraph and how do you read one? What does it tell you that cProfile output does not?

A flamegraph is a 2D visualisation of call stacks sampled over time. The x-axis represents time (or fraction of samples), and width of each bar represents how much time was spent in that function. The y-axis represents stack depth - the bottom bar is the entry point, each row above shows functions called by the row below. Colour is arbitrary.

To read it: look for wide bars near the top of the stack. Those are leaf functions (no more callees) where actual time is being spent. A wide bar in the middle of the stack means a function spends most of its time in a single callee - follow the stack upward to find the real work.

What flamegraphs show that cProfile does not:

Context: you see which call path leads to a slow function - e.g., handle_request → validate_input → regex_compile vs handle_request → load_user → regex_compile. Both show up as regex_compile in cProfile's flat table, but the flamegraph shows they are in different code paths.
Proportion: the visual width immediately communicates relative cost without scanning columns.
Real-time observation: generated from a sampling profiler attached to a live process, not a special profiling run.

Q5: An engineer says "I profiled my service and found that json.loads consumes 40% of CPU. Should I rewrite the JSON parsing in C?"

Not as the first step. There are several things to investigate before rewriting anything:

Can you reduce the amount of JSON being parsed? If you are parsing the same document multiple times per request (e.g., parsing a config file on every request instead of once at startup), eliminate the redundant calls first.
Can you use a faster JSON library? orjson is a Rust-based JSON library that is 3–10x faster than stdlib json for most workloads. It is a one-line change: import orjson; data = orjson.loads(raw). msgspec is another option. Test both - the right answer depends on your data shape.
Can you reduce the payload size? If you are parsing large JSON objects but only using 10% of the fields, consider a streaming parser or a schema-aware deserialiser that skips unused fields.
Can you cache the parsed result? If the same JSON content is parsed multiple times (common with configuration, schema objects, or repeated API responses), cache with functools.lru_cache or a TTL cache.

Only if none of these approaches is available should you consider writing a C extension - and at that point, orjson already is a C/Rust extension, so "rewriting in C" likely means investing in a domain-specific binary format like msgpack, protobuf, or flatbuffers instead of JSON.

The lesson: a 40% cost in json.loads is almost always addressable without writing C code.

The Three-Day Mistake​

What You Will Learn​

Prerequisites​

The Python Performance Measurement Stack​

Section 1: cProfile and pstats​

Basic Usage​

cProfile.Profile() as Context Manager​

pstats for Filtering and Sorting​

snakeviz - Visual Flamegraphs​

Section 2: line_profiler - Finding the Slow Line​

Basic Usage​

Programmatic Usage​

Section 3: py-spy - The Production-Safe Profiler​

Profiling a Running Process​

Profiling a FastAPI Service Under Load​

Reading Flamegraphs​

Profiling a Kubernetes Pod​

Why Sampling Profilers Are Safer​

Section 4: memory_profiler - Line-by-Line Memory​

Basic Usage​

mprof - Memory Over Time​

Section 5: tracemalloc - Allocation Tracking Without Dependencies​

Snapshot Diffing - Finding Leaks in Long-Running Processes​

Finding Where an Object Was Allocated​

Section 6: timeit - Correct Micro-Benchmarks​

The Wrong Way​

The Right Way​

%timeit in Jupyter​

Common timeit Mistakes​

Section 7: Profiling in Production​

Continuous Profiling with py-spy​

pyinstrument for Development​

Pyroscope - Always-On Profiling​

Section 8: The Profiling Discipline - A Workflow​

Red-Yellow-Green Triage​

Interview Questions​

The Three-Day Mistake

What You Will Learn

Prerequisites

The Python Performance Measurement Stack

Section 1: `cProfile` and `pstats`

Basic Usage

`cProfile.Profile()` as Context Manager

`pstats` for Filtering and Sorting

`snakeviz` - Visual Flamegraphs

Section 2: `line_profiler` - Finding the Slow Line

Basic Usage

Programmatic Usage

Section 3: `py-spy` - The Production-Safe Profiler

Profiling a Running Process

Profiling a FastAPI Service Under Load

Reading Flamegraphs

Profiling a Kubernetes Pod

Why Sampling Profilers Are Safer

Section 4: `memory_profiler` - Line-by-Line Memory

Basic Usage

`mprof` - Memory Over Time

Section 5: `tracemalloc` - Allocation Tracking Without Dependencies

Snapshot Diffing - Finding Leaks in Long-Running Processes

Finding Where an Object Was Allocated

Section 6: `timeit` - Correct Micro-Benchmarks

The Wrong Way

The Right Way

`%timeit` in Jupyter

Common `timeit` Mistakes

Section 7: Profiling in Production

Continuous Profiling with `py-spy`

`pyinstrument` for Development

Pyroscope - Always-On Profiling

Section 8: The Profiling Discipline - A Workflow

Red-Yellow-Green Triage

Interview Questions